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Abstract 

We consider a recursive algorithm to construct an aggregated esti- 
mator from a finite number of base decision rules in the classification 
problem. The estimator approximately minimizes a convex risk func- 
tional under the ^-constraint. It is defined by a stochastic version 
of the mirror descent algorithm (i.e., of the method which performs 
gradient descent in the dual space) with an additional averaging. The 
main result of the paper is an upper bound for the expected accuracy 
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of the proposed estimator. This bound is of the order a/ (log M)/t 
with an explicit and small constant factor, where M is the dimension 
of the problem and t stands for the sample size. A similar bound 
is proved for a more general setting that covers, in particular, the 
regression model with squared loss. 

1 Introduction 

The methods of Support Vector Machines (SVM) and boosting recently be- 
came widely used in the classification practice (see, e.g., [121 123 I2E1 135]). 
These methods are based on minimization of a convex empirical risk func- 
tional with a penalty. Their statistical analysis is given, for instance, in 
[3J I20J 123 EH| where one can find further references. In these papers, the 
classifiers are analyzed as if they were exact minimizers of the empirical risk 
functional but in practice this is not necessarily the case. Moreover, it is 
assumed that the whole data sample is available, but often it is interesting 
to consider the online setting where the observations come one-by-one, and 
recursive methods need to be implemented. 

There exists an extensive literature on recursive classification starting 
from Perceptron and its various modifications (see, e.g., the monographs PUI21 
133] and the related references therein, as well as the overviews in [TUl HI])- We 
mention here only the methods which use the same loss functions as boosting 
and SVM, and which may thus be viewed as their online analogues. Probably, 
the first technique of such kind is the method of potential functions, some 
versions of which can be considered as online analogues of SVM (see P3 12] 
and [TTj, Chapter 10). Recently, online analogues of SVM and boosting- type 
methods using convex losses have been proposed in fOJEEl- In particular, in 
38J, a stochastic gradient algorithm with averaging is studied for a general 
class of loss functions (cf. [24J). All these papers use the standard stochastic 
gradient method for which the descent takes place in the initial parameter 
space. 

In this paper, we also suggest online versions of boosting and SVM, but 
based on a different principle: the gradient descent is performed in the dual 
space. Algorithms of this kind are known as mirror descent methods |2~T] . 
and they were initially introduced for deterministic optimization problems. 
Their advantage, as compared to the standard gradient methods, is that the 
convergence rate depends logarithmically on the dimension of the problem. 
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Therefore, they turn out to be very efficient in high- dimensional problems 

Some versions of the original mirror descent method of Nemirovski and 
Yudin [21] were derived independently in the learning community and have 
been applied to classification and other learning problems in the papers 
[TH IT7| ITS] where bounds for the relative risk criterion were obtained. How- 
ever, these results are formulated in a deterministic setting and they do not 
extend straightforwardly to the standard stochastic analysis with a mean risk 
criterion (see [T71 |HJ |5] for insights on the connections between the two types 
of results). Below we propose a novel version of the mirror descent method 
which attains the optimal bounds of the mean risk accuracy. Its main dif- 
ference from the previous methods is the additional step of averaging of the 
updates. 

The goal of this paper is to construct an aggregated decision rule: we in- 
troduce a fixed and finite base class of decision functions, and we choose the 
weights in their convex or linear combination in an optimal way. The opti- 
mality of weights is understood in the sense of minimization of a convex risk 
function under the ^-constraints on the weights. This aggregation problem 
is similar to those considered, for instance, in jT3j and [31] for the regression 
model with squared loss. To solve the problem, we propose a recursive algo- 
rithm of mirror descent type with averaging of the updates. We prove that 
the algorithm converges with a rate of the order y (logM)/t, where M is the 
dimension of the problem, and t stands for the sample size. 

The paper is organized as follows. First, we give the problem statement 
and formulate the main result on the convergence rate (Section |2J). Then, the 
algorithm is described (Section EJ) and the proof of the main result is given 
(Section EJ). In Sectional the result is extended to general loss functions and 
to general estimation problems. Discussion is given in Sectional 



2 Set-up and Main Result 

We consider the problem of binary classification. Let (X, Y) be a pair of 
random variables with values in X x {— 1,-1-1} where X is a feature space. 
A decision rule g* : X — > { — 1, +1}, corresponding to a measurable function 
/ : X — > M. is defined as gAx) = 2I[/( a; ) > o] — 1, where denotes the indicator 
function. A standard measure of quality of a decision rule is its risk 
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which equals to the probability of misclassification R(g^) = f{Y ^ g^(X)}. 
Optimal decision rule is defined as g** , where /* is a minimizer of R{gj) over 
all measurable /. The optimal rule is not implementable in practice since the 
distribution of (X, Y) is unknown. In order to approximate g^* , one looks for 
empirical decision rules g n based on a sample (Xi, Y"i), . . . , (X n , Y n ), where 
(Xi, Yj) are independent random pairs having the same distribution as (X, Y). 

An abstract approach to construction of empirical decision rules 1351 
EH1 prescribes to search g n in the form g n = g ? , where /„ is a minimizer of 
the empirical risk (empirical classification error): 

1 n 

i=l 

over all / from a given class of decision rules. Conditions of statistical op- 
timality of the method of minimization of empirical classification error (0) 
have been extensively studied (see in particular HU EH] ) • However, this 
method is not computationally tractable, since the risk functional R n in ([TJ 
is not convex or even continuous. In practice, efficient methods like SVM 
and boosting implement numerical minimization of convex empirical risk 
functionals different from as it has been first noticed in [7j and • The- 
oretical analysis was provided recently in several papers |3J 1201 E3 I2H] where 
consistency and rates of convergence of convex risk minimization methods 
are established in terms of the probability of misclassification. 

A key argument used in these works is that, under rather general assump- 
tions, the optimal decision rule g^* coincides with qja where f A is optimal 
decision function in the sense that it minimizes a convex risk functional called 
the <y2-risk and defined by 

A(f)=E V (Yf(X)) 

where (p : R — > 1R + is a convex loss function and E denotes the expectation. 
Typical choices of loss functions are the hinge loss function (p(x) = (1 — x), 
(used in SVM), as well as the exponential and logit losses <p(x) = exp(— x) 
and <f(x) = log 2 (l + exp(— x)) respectively (used in boosting). 

Thus, to find an empirical decision rule 7j n which approximates the opti- 
mal gf* , we can consider minimizing the empirical if -risk 

A n (f) = -J2 V (Y t f(X l )) , 



4 



which is an unbiased estimate for A(f). This strategy is further justified by 
a result in [37] generalized in j3] . This minimization problem is simpler than 
the original one because it can be solved by standard numerical procedure, 
the functional A n being convex. When relevant penalty functions are added, 
it leads to some versions of boosting and SVM algorithms. At the same time, 
one needs the whole sample Yi), . . . , (X n , Y n ) for their implementation, 
i.e., these are batch procedures. 

In this paper, we consider the problem of minimization of the yj-risk A on 
a parametric class of functions / when the data {Xi, Yi) come sequentially 
(online setting). 

Let us introduce the parametric class of functions in which / is selected. 
Suppose that a finite set of base functions {hi, ... , h^} is given, where hj : 
X -> [-K, K), j = 1, . . . , M, K > is a constant, and M > 2. We denote 
by H the vector function whose components are the base functions: 

\Jx G X, H(x) = (hi(x), h M (x)f . (2) 

A typical example is the one where the functions hj are decision rules, i.e., 
they take values in { — 1, 1}. Furthermore, for a fixed A > we denote the 
A-simplex in M M by Qm, a : 

{M 
0=(^,...,^) T Glf : E^ ) = A 

Introduce a family of A-convex combinations of functions hi, ... , hu : 

Fm,\ = {fo = 9 T H : 9 E Qm, a} • 

over which minimization of the yj-risk A will be performed. The minimization 
of A(f) over all / G Fm,\ is equivalent to the minimization of A(fg) over all 
9 G 9m, a , so we simplify the notation and write in what follows 

A{9) 4 A(f e ) . 

Define the vector of optimal weights of the A-convex combination of the base 
functions as a solution to the minimization problem 

min A{6). (3) 
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We assume that the distribution of (X, Y) is unknown, hence the function 
A is also unknown, and its direct minimization is impossible. However, we 
have access to a training sample of independent pairs (Xi,Yi), having the 
same distribution as (X, Y) that are delivered sequentially and may be used 
for estimation of the optimal weights. 

In the following section, we propose a stochastic algorithm based on the 
mirror descent principle which, at the t-th iteration, yields the estimate 9 t = 
9 t ((Xi, Yi), . . . , (Xt-i, Yt-i)) of the solution to the problem (J3J). The estimate 
9 t is measurable with respect to (9 t -±,X t _i,Y t _i), which means that the 
algorithm fits with the online setting. In order to obtain the updates of the 
algorithm, it is sufficient to have random realizations of the sub-gradient of 
A which have the form: 

Ui{9) = <p '(Yi9 T H(Xi))YiH(Xi) e R M , i= 1,2,..., (4) 

where <p ' represents an arbitrary monotone version of the derivative of <p (one 
may take, for instance, the right continuous version). 

Given 9 t , the convex combination 9^ T H(-) of the base functions can be 
constructed, and it defines an aggregated decision rule 

g t {x) = 2I^ Th{x)> ^ - 1 . 

Statistical properties of this decision rule are described by the following re- 
sult which establishes the convergence rate for the expected accuracy of the 
estimator 9 t with respect to the <y?-risk. 

Theorem 1 For a given convex loss function (p, for a fixed number M > 2 
of base functions and a fixed value of A > 0, let the estimate 9 t be defined by 
the algorithm of Subsection\3.4\ Then, for any integer t > 1, 



EAfy- mi n^)<C (lnM) ' /2VrT7 , (5) 

#g©m,a t 

where C = C(ip,\) = 2XL ip (X) and L^X) = K sup xg r kx,k\] Wfa)]- 

For example, Theorem Q holds with constant C = 2 in a typical case 
where we deal with convex (A = 1) aggregation of base classifiers hj taking 
values in {—1,1} and we use the hinge loss <f(x) = (1 — x) + . We also note 
that Theorem ^ is distribution free: there is no assumption on the joint 
distribution of X and Y except, of course, that Y takes values in { — 1, 1} 
since we deal with the classification problem. 
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Remark 1 (efficiency.) The rate of convergence of order a/ (In M)/t is 
typical without low noise assumptions (as they are introduced in |32j). Batch 
procedures based on minimization of the empirical convex risk functional 
present a similar rate. From the statistical point of view, there is no re- 
markable difference between batch and our mirror descent procedure. On 
the other hand, from the computational point of view, our procedure is quite 
comparable with the direct stochastic gradient descent. However, the mir- 
ror descent algorithm presents two major advantages as compared both to 
batch and to direct stochastic gradient: (i) its behavior with respect to the 
cardinality of the base class is better than for direct stochastic gradient de- 
scent (of the order of y/lnM in Theorem [TJ instead of M or \fM for direct 
stochastic gradient); (ii) mirror descent presents a higher efficiency especially 
in high-dimensional problems since its algorithmic complexity and memory 
requirements are of strictly smaller order than for corresponding batch pro- 
cedures (see [13] for a comparison). 

Remark 2 (optimality of the convergence rate.) Using the tech- 
niques of fH] and [HI! it is not hard to prove the minimax lower bound 
on the excess risk EA(^) — min eG e MA A{9) having the order ^/(ln M)/t for 
M > tV2+« w ith some 5 > 0. This indicates that the upper bound of Theo- 
rem n is rate optimal for such values of M. 

Remark 3 (choice of the base class.) We point out that the good 
behavior of this method crucially relies on the choice of the base class of 
functions {hj}i<j<M- A natural choice would be to consider a symmetric class 
in the sense that if an element h is in the class, then —h is also in the class. For 
a practical implementation, some initial data set should be available in order 
to pre-select a set of M functions (or classifiers) hj. Another choice which is 
practical and widely spread is to choose very simple base elements hj such 
as decision stumps; nevertheless, aggregation can lead to good performance 
if their cardinality M is very large. As far as theory is concerned, in order to 
provide a complete statistical analysis, one should establish approximation 
error bounds on the quantity inf f^ M \ A(f) — inf f A(f) showing that the 
richness of the base class is reflected both by diversity (orthogonality or 
independence) of the h/s and by its cardinality M. For example, one can 
take hj's as the eigenfunctions associated to some positive definite kernel. We 
refer to [30J for related results, see also [29 . The choice of A can be motivated 
by similar considerations. In fact, if the approximation error is to be taken 
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into account, it might be useful to take A depending on the sample size t and 
tending to infinity with some slow rate (cf. 20i). A balance between the 
stochastic error as given in Theorem ^ and the approximation error would 
then determine the optimal choice of A. These considerations are left beyond 
the scope of the paper, since we focus here on the aggregation problem. 



3 Definition and Discussion of the Algorithm 

In this section, we introduce the proposed algorithm. It is based on the mirror 
descent idea going back to Nemirovski and Yudin [21] and is the stochastic 
counterpart of Nesterov's primal-dual subgradient method of deterministic 
convex optimization, studied in j22] and (22] • We first give some definitions 
and recall some facts from convex analysis. 

3.1 Proxy functions 

We will denote by E = If 1 the space M M equipped with the 1-norm 

M 

Nli = E 1^1 

3=1 

and by E* = the dual space which is M M equipped with the sup-norm 
llzll™ = max z T 6 = max \z^'\ , G E* , 

||*||i=l l<i<Af' 

with the notation z = (z^\ . . . , z^) T . 

Let G be a convex, closed set in E. For a given parameter j3 > and a 
convex function V : — ► R, we call j3 -conjugate function of V the Legendre- 
Fenchel type transform of f3V: 

VzeE*, Wp(z) = sup {-z T - PV(0)\ . (6) 

eee 

Now we introduce the key assumption (Lipschitz condition in dual norms 
|| • Hi and || • || that will be used in the proofs of Theorems [T] and El 
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Assumption (L). A convex function V : — > R zs sncn tna£ its f3 -conjugate 
W/3 is continuously differentiable on E* and its gradient VWp satisfies: 

\\VW^z)-VW^~z)\\ 1 <^-\\z-~z\\ 00 , VveFJ>0, 

where a > is a constant independent of (3. 

This assumption is related to the notion of strong convexity w.r.t. the 
|| ■ ||i-norm (see, e.g., [5] ESj). 

Definition 1 Fix a > 0. A convex function V : —>■ R is sazd to 6e 
a-strongly convex rai/i respect to the norm || ■ ||i if 

V(sx + (1 - < sV(x) + (1 - s)V(3/) - - s) \\x - y\\\ (7) 

for all x, y G and any s G [0, 1]. 

The following proposition sums up some properties of /3-conjugates and, 
in particular, yields a sufficient condition for Assumption (L). 

Proposition 1 Consider a convex function V : — > R and a strictly posi- 
tive parameter (3. Then, the (3 -conjugate Wp of V has the following proper- 
ties. 

1. The function Wp : E* — > R is convex and has a conjugate (3V , i.e. 

V#G0, (3V{6) = sup {-z T 9- Wp{z)} . 

z&E* 

2. If V is a-strongly convex with respect to the norm || ■ ||i then 

(i) Assumption (L) holds true, 

(ii) argmax{-,2 T # - (3V{6)\ = -VWp(z) G . 

flee 

For a proof of this proposition we refer to |5J I2S1- Some elements of the 
proof are given in the Appendix, subsection B. 

Definition 2 We call a function V : — > R + proxy function if it is convex, 
and 
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(i) there exists a point 9* G such that min^e V(9) = V(9*) , 

(ii) Assumption (L) holds true. 

Example. Consider the entropy type proxy function: 

V0eQ M ,x, V(6) = \\n( — \+J2° {j) ^6 {j) (8) 

' i=i 

(where InO = 0) which has a single minimizer 9* = (A/M, . . . , \/M) T with 
V{9*) = 0. It is easy to check (see Appendix, subsection B) that this function 
is a-strongly convex with respect to the norm || • with the parameter 
a = 1/A. An important property of this choice of V is that the optimization 
problem (0) can be solved explicitly so that Wp and VWp are given by the 
following formulas: 




Assumption (L) for function (jHJ) holds true, as can be easily proved by direct 
calculations without resorting to Proposition [T] (see Appendix, subsection A). 
Furthermore, note that for A = 1 the following holds true: 

• the entropy type proxy function as defined in (jHJ) corresponds to the 
Kullback information divergence between the uniform distribution on 
the set {1,...,M} and the distribution on the same set defined by 
probabilities 9 {j \ j = 1, . . . , M, 

• in view of ()10j). the components of the vector —VWp(z) define a Gibbs 
distribution on the coordinates of z, with (3 being interpreted as a 
temperature parameter. 
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3.2 Algorithm 



Mirror descent algorithms are optimization procedures achieving a stochastic 
gradient descent in the dual space. The proposed algorithm presents two 
modifications: first, it uses updates of the stochastic sub-gradient, and also 
it presents an averaging step of the iterate outputs. At each iteration i, a 
new data point (JQ, Y{) is observed and there are two updates: 

• one is the variable Q which is defined by the stochastic sub-gradients 
Uk(0k~i), k = 1, • • • , i, as the result of the descent in the dual space E*, 

• the other update is the parameter 9i which is the "mirror image" of Q 
in the initial space E. 

In order to tune the algorithm properly, we will also need two fixed 
positive sequences (7i)j>i (step size) and (/3j)i>i ("temperature") such that 
A > A-i j Vi > 1. The algorithm is defined as follows: 



Fix the initial values 6 E Q and Co = e R M . 
For i — 1, . . . , t — 1, do the recursive update 

Ci = Ci-i +liUi{0i-i) , 

9i = -Wa(6). 
Output at iteration t the following convex combination: 



6 t = 5^ . (12) 



Note that if V is the entropy type proxy function defined in (|S|L the 
components 9^ of vector 9\ from ()11|) have the form 

Aexp I -fit 1 ^2 

' M / i X ' ' L '- ) ' 

^exp I ^ lmU m ,k(6m-l) 

k=l \ m=l 

11 



where u m j(9) represents the j-th component of vector u m (9), j = 1, . . . , M. 



3.3 Heuristic considerations 

Suppose that we want to minimize a convex function 9 \— > A(9) over a convex 
set 9. If 9 ,...,9 t _i are the available search points at iteration t, we can 
provide the affine approximations fa of the function A defined, for 9 G G, by 

fa(9) = A{9^ x ) + (9- 9 i _ 1 ) T VA(9 i - 1 ), i = l,...,t. 

Here 9 i— > VA(#) is a vector function belonging to the sub-gradient of A(-). 
Taking a convex combination of the fa's, we obtain an averaged approxima- 
tion of A{9): 

- m = E!=i7« W-i) + (g-^i) r V^-i)) 

E*=i 7i 

At first glance, it would seem reasonable to choose as the next search point 
a vector 9 G minimizing the approximation fa, i.e., 

$t = argmin fa(9) = argmin 9 T ^7 i VA(6' i _i) . (14) 



eee eeo 



j=i 



However, this does not make any progress, because our approximation is 
"good" only in the vicinity of search points 9 , ...,9 t ~i- Therefore, it is 
necessary to modify the criterion, for instance, by adding a special penalty 
B t (9, 9t-i) to the target function in order to keep the next search point 9 t in 
the desired region. Thus, one chooses the point: 



argmin 

eee 



T [^2'yiVA{e i - 1 ))+B ) 



.1=1 



(15) 



Our algorithm corresponds to a specific type of penalty B t (9, 6 t -i) = (3 t V{9), 
where V is the proxy function. 

Also note that in our problem the vector-function VA(-) is not available. 
Therefore, we replace in (J 15)) the unknown gradients VA(0j_i) by the ob- 
served stochastic sub-gradients Uj(^_i). This yields a new definition of the 
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t-th search point: 

= argmax [-£6 - t V(9)] , 
9ee 

(16) 
where 

t 

Ct = ^7it/»(0i_i). 
i=i 

Observe that by Proposition[TJ the solution to this problem reads as —VWp t (Ct) 
and it is now easy to deduce the iterative scheme (fTTj). 



#t = argmin 



,i=i 



3.4 A particular instance of the algorithm 

We now define the special case of the mirror descent method with averaging 
for which Theorem ^ is proved. We consider the algorithm described in 
Subsection 13.21 with the entropy type proxy function V as defined in (jSj) and 
with the following specific choice of the sequences (7i)«>i and (A)i>i : 



7i = l, # = /Wi+T, 2 = 1,2,..., (17) 

where 

Po= LyWQnM)- 1 ' 2 . (18) 

Thus, the algorithm becomes simpler and can be implemented in the follow- 
ing recursive form: 

d = Ci-i + Ui(9i-i) , (19) 

Oi = -W ft (Ci), (20) 

Bi = %-x - \ - 9^) , 2 = 1,2,..., (21) 



% 

with initial values Co = , 6> G © and (/3j)j>i from (fTTj). (fTHj) . 



3.5 Comparison to other Mirror Descent Methods 

The versions of mirror descent method proposed in ,21 are somewhat dif- 
ferent from our iterative scheme (fTTj) . One of them, which is the closest to 
(fTTj) . is studied in detail in 0. It is based on the recursive relation 

e. = -vWx[-vv{e^x)+iiUi{e^x)), 2 = 1,2,..., (22) 
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where the function V is strongly convex with respect to the norm of initial 
space E (which is not necessarily the space if 1 ) and W\ is the 1-conjugate 
function to V . 

If 6 = R M and V{9) = ±||0||i, the scheme of (J22|) coincides with the 
ordinary gradient method. 

For the unit simplex = Qm,i an d the entropy type proxy function V 
from (JHJ), the components 9^ of vector $i from ()22|) are: 



Vj = i,...,m, e\ 



U) 



gS_exp (-7iitj,j(g,-i)) 

M 

X] 0<-i ex P (-7i«i,ifc(^-i)) 



fc=i 



exp ( - ^ lmU m ,j{0m-l, 



m=l 



M 

3 (*) 



(23) 



^ exp I - ^ 7m«m,fc(0m-l) 



fc=l \ m=l 



The algorithm ()23|) is also known as the exponentiated gradient (EG) method 
[T7j . The differences between the algorithms (J22J), d2Sj) and ours are the 
following: 



• the initial iterative scheme is different from that of 
particular, it includes the second tuning parameter P i ; moreover, the 
algorithm (|23J) uses initial value 8q in a different manner; 

• along with (jllj) . our algorithm contains the additional step of averaging 
of the updates (JT2J). 

Papers jU [TH] study convergence properties of the EG method (|2*3^1 in 
a deterministic setting. Namely, they show that, under some assumptions, 
the difference A t {6 t ) — ming e Q M1 A t (9) is bounded by a constant depending 
on M and t. If this constant is small enough, these results show that the 
EG method provides good numerical minimizers of the empirical risk A t . 
However, they do not apply to the expected risk. In particular, they do not 
imply that the expected risk KA(6t) is close to the minimal possible value 
min ee e M1 A(9), which, as we prove, is true for the algorithm with averaging 
proposed here. 
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Finally, we point out that the algorithm (|22j) may be deduced from the 
ideas mentioned in Subsection 13.31 and which are studied in the literature on 
proximal methods within the field of convex optimization (see, e.g., ^§1 H] 
and the references therein). Namely, under rather general conditions, the 
variable 9i from (J2~2"|) is the solution to the minimization problem 



where the penalty B(9, 9^i) = V{9) - V(9^i) - (9 - ti) T V^(ti) rep- 
resents the Bregman divergence between 9 and related to the strongly 
convex function V. 

4 Proofs 

In this section, we provide technical details leading to the result of Theorem 
[3 They will be given in a more general setting than that of Theorem Q (cf 
Theorem 1 of '22 J- Namely, we will consider an arbitrary proxy function 
V and use the notations and assumptions of Subsection 13.21 Propositions 
12] and El below are valid for an arbitrary closed convex set in E, and for 
the estimate sequences (6^) and defined by the algorithm (fTTj) - (fT2"j) . The 
argument up to the relation (J29j) in the proof of Theorem ^ is valid under 
the assumption that B is a convex compact set in E. 

Introduce the notation 



where the random functions Ui(9) are defined in (@J). Note that the mapping 
9 1 — > Kui(9) belongs to the sub-differential of A (which explains the notation 
VA). This fact and the inequality EH^^HL < £j,(A), valid for all 9 e 0, 
are the only properties of Ui that will be used in the proofs, other specific 
features of definition being of no importance. 

Proposition 2 For any 9 G and any integer t > 1 the following inequality 



9i = argmin (9 T -f i u i (9 l ^ 1 ) + B(9, 9^)) , 



(24) 



W9 e 0, VA{9) 



E Ui (9), 

u t (9) - VA(9) , 
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holds 

t 

5>(*i-i -e) T vA{e ir . x ) 
i=i 

t t 2 

< ^v(fl) - WW.) - E^-i - flfW-i) + E «-§ — 

i=i i=i ^ 

Proof. By continuous differentiability of Wp t _ 1: we have 

Wft-xCCi) = w^te-i) + Ac* " Ci-ifVWf^Mi + (i - dr. 

Jo 

Put = Then ^ — = 7^, and by the Assumption (L) 

+ 7* / t;f[vW^ 1 (rC i + (l-T)<i_ 1 )-VW' A _ 1 (Ci-i)ldr 
< ^((w) +7i«fVW A _ 1 «i_ 1 ) 



+ 7»IH 



HVW^CrCi + (1 - r)d-i) ~ V^.^O-OILdr 

Jo 



< W^Ci-i) +7i«fVW A _ 1 «i_ 1 ) + Vll '' lh: 



2a/3, 



i-i 



Using the last inequality, the fact that (/3j)j>i is a non- decreasing sequence 
and that, for z fixed, /3 1— > Wp{z) is a non-increasing function, we get 



WpXd) < WA_i(Ci) < Wa-x^-i) -7A-i^ + """" " 



T „, 1 7i Halloo 



When summing up, we obtain 

E^f-^ < ^o(Co) - Wft(c*) + E • 
i=i i=i 

Using the representation ^ = X^=i 7^ we § e t that, for any 6 G 0, 



t 



21L, 112 



E^-i - 0) t t* < ^ (Co) - W A (C*) - C* r + E S 

i=l i=l Za Pi~l 

16 



I 00 



Finally, since v { = J \/A(9 i _ 1 ) + ^(^_i) we find 
t 



i=i 



< w A (Co) - Wk(c t ) - - 5>(fc-i - *) T fc(**-i) + E 



* 2 11 112 



Thus, the desired inequality follows from the fact that 

^o(Co) = W^b(O) = (5 Q su V {-V{9)} = -PoV(6.) 

eee 

and /3V (9) > -W p (() - ( T 6, for all C G M M ■ 

Now we derive the main result of this section. 

Proposition 3 For any integer t > 1, the following inequality holds true: 

W(o)-p v(9*y 



E A(9 t ) < inf 



A{9) 



Ej=l 7i 



7? 



.1=1 



(25) 

Hence, the expected accuracy of the estimate 9 t with respect to the (p-risk 
satisfies the following upper bound: 



EA®)-minA(0)<-J— ftm)-») + 

eee E i= i7i V ^2aA_i 

(26) 

where 9* A G Argmin ege 

Proof. For any # G 0, by convexity of A, we get 

EU^(E^-i)-^)) 



„2 



EA(9 t )-A(9) < 



Ej=l 7« 



< Eli^Kti-gfv^-i^ (27) 

El=i7» 

Conditioning on and then using both the definition of and the 
independence between Oi_i and (X^Yj), we obtain 
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We now combine (J27j) and the inequality of Proposition El where we make 
use of the bound E |K(0)||^ < £j(A). This leads to (j^- Inequality (JUJ) is 
straightforward in view of (J2HJ). ■ 



Remark 4 Note, that simultaneous change of scale in the definition of the 
sequences ($) and (7,) (i.e. multiplying them by the same positive constant 
factor) does not affect the upper bounds in the previous propositions, though 
it might affect the estimate sequences and (#,) of the algorithm (fTTjl - ffT^j) . 



Proof of Theorem QJ We have V(6*) = 0, and 

V(6* A ) < meLxV{6) = V* 



Using (|2^|) with the choice ji = 1 and $ = Po^/i + 1 for po > 0, i > 1, we 
get 



Ei4((9 t ) - A(9* A ) < ^— p V* + . (28) 

t \ ap 

Optimizing this bound in p leads to the choice: 

VaV* 

which gives the bound: 



2L V (A) /V* 



E A(0 t ) - A(9* A ) < —^\ —(t + 1) • (29) 

t \ a 

We now recall that 6 = Qm,x and that, for the proxy function V de- 
fined in (JSJ), we have a = A -1 . Furthermore, this proxy function attains its 
maximum at each vertex of the A-simplex Qm,x and satisfies 

V* = max V(0) =AlnM. (30) 

Therefore, the optimal value Po equals L (/ ,(A)(lnM)~ 1 / 2 . This gives the ac- 
curacy bound as in the statement of Theorem ^ ■ 
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5 Extension 



Theorem ^ can easily be extended to a more general framework. Inspection 
of the proof indicates that it does not use a specific form of the loss function 
or of the proxy function. The required properties of these functions are 
summarized at the beginning of Section 0] We now state a more general 
result. First, introduce some notation. 

Consider a random variable Z which takes its values in a set Z. The 
decision set 9 is supposed to be a convex and closed set in R M , and a loss 
function Q : Q x Z M. + such that the random function Q(- , Z) : O — > M + 
is convex almost surely. Define the convex risk function A : — ► R + as 
follows: 

A{6) = EQ(0,Z). 

The training sample is given in the form of an i.i.d. sequence (Z\, . . . , Z t -\), 
where each Z^ has the same distribution as Z . Our aim now consists in 
criterial minimization of A over (see, e.g., j2S|), which means that we 
characterize the accuracy of the estimate 9t = @t{Zi, . . . , Z t -\) G minimiz- 
ing A, by the difference: 

EA(fi t ) - mm A(6) 
(we assume that min A{6) is attainable). We denote by 

Ui{d) = V 9 Q(B,Zi), 2 = 1,2,..., (31) 

the stochastic sub-gradients which are measurable functions defined on x Z 
such that, for any 6 6 0, their expectation E«j(0) belongs to the sub- 
differential of the function A{9). 



Theorem 2 Let be a convex closed set in Mr , and Q be a loss function 
which meets the conditions mentioned above. Moreover, assume that: 

supE||V,Q(#,Z)||^<L! iQ , (32) 

where Lq^q is a finite constant. Let V be a proxy function on satisfying 
Assumption (L) with a parameter a > 0, and assume that there exists 9* A e 
ArgmhiggQ A{9). Then, for any integer t > 1, the estimate 9 t , defined in 



19 



Subsection \3. ifl with stochastic sub- gradients < f37)) and with sequences (7i)i>i 
and (/3j)j>i /rom |7^) with arbitrary j3 > 0, satisfies the following inequality: 

E A(6 t ) - min A{6) < Uv{9\) + ^±1 . (33) 

Furthermore, if V is a constant such that V r (6 1 ^) < V and we set (3q = 
L QiQ (a V)- 1 / 2 , then 



EA(9 t ) - mmA{6) <2L e , Q {a^V) 1 ' 2 . (34) 

In particular, if is a convex compact set, we can take V = max^e V{9). 

This theorem follows from the proofs of Section |U (cf. (j2HI), (J2EJ), and 
(J2HJ)), where L® q should replace the constant -^(A). It generalizes Theorem 
^and encompasses different statistical models, including the one described in 
Section El where Z plays the role of the pair of variables (X, Y), = 6m, a, 
and Q(6, Z) = ip(Y8 T H(X)). In the same way, Theorem|2]is also applicable 
to the standard regression model with squared loss Q(9, Z) = (Y—9 T H(X)) 2 , 
in which case a similar result has been proved for another method in [15J. 

Remark 5 (dependent data). Inspection of the proofs shows that The- 
orem El can be easily extended to the case of dependent data Z^. In fact, 
instead of assuming that are i.i.d., it suffices to assume that they form 
a stationary sequence, where each Zi has the same distribution as Z . Then 
Theorem El remains valid if we additionally assume that the conditional ex- 
pectation E i-\)\Qi-\) = a.s. 

6 Discussion 

To conclude, we discuss further the choices of the proxy function V, of the 
parametric set and of the sequences (A)j>i, (7i)t>i- 

6.1 Choice of the proxy function V 

The choice of the entropic proxy function defined in (jHJ) is not the only 
possible. A key condition on V is the strong convexity with respect to the 
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norm || • which guarantees that Assumption (L) holds true. Therefore, 
one may also consider other proxy functions satisfying this condition, such 
as, for example: 

V* e R M , V{6) = ±;\\e\\l = ^ (f ( dU) ) P ) '> ( 35 ) 

where p = 1 + 1/ In M (see 0)- In contrast with the function (JHJ), the proxy 
function (J35|) can be used when G is any convex and closed set in M M . 
For the simplex Qm,x, one may consider functions of the form 

M 

v^e M ,A, v(6) = c + c 1 J2{o (j) Y + \ s = i^' ( 36 ) 

where the constants Co = — A 2 /(es(s+ 1)), C\ = X 1 ~ s /(s(s + l)) are adjusted 
in order to have mine e e M A V{6) = 0. It is easy to see that the proxy function 
defined in f!36|) is a-strongly convex in the norm || • 1^. When A = 1, this 
proxy function equals to a particular case of /-divergence of Csiszar (see the 
definition in |34j ) between the uniform distribution on the set {1, . . . , M} and 
the distribution on the same set defined by probabilities Recall that for 
A = 1 the proxy function defined in (JHJ) equals to the Kullback divergence 
between these distributions. Presumably, other proxy functions can be based 
on some properly chosen /-divergences of Csiszar. 

On the other hand, if a proxy function V is such that the gradient of its /3- 
conjugate VWjj cannot be explicitly written, the numerical implementation 
of our algorithm might become time-consuming. 

^From the upper bound ()34j) we can see that an important characteristic 
of V is the ratio V/a (or (max^ge V{6)/a) if the set 9 is bounded) and 
thus one can look for optimal proxy functions minimizing this ratio. We 
conjecture that for = Om,a such an optimal proxy function is the entropy 
type function given in (jHJ); however, we do not have a rigorous proof of this 
fact. For the latter, we have 

- max Vie) = A 2 In M . 

a eee M ,x 

For other proxy functions, this ratio is of the same order. For instance, it is 
proved in [H] (Lemma 6.1) that the proxy function defined in ()35j) satisfies 

- max V{9) = 0(l)A 2 lnM. 

a 8&e M .\ 
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This relation is true for the proxy function defined in (J36|) as well. 

Finally, note that a widely used penalty function as || • ||i is not a proxy 
function in the sense of Definition |21 as it is not strongly convex with respect 
to || • ||i. Another frequently used penalty function V(9) = * s strongly 
convex. However, it can be easily verified that its "performance ratio" is 
extremely bad for large M: this function V satisfies 

- max V(9) = -X 2 M . 
a eee M , A 2 

6.2 Other parametric sets G 

Theorem 121 holds for any convex closed contained in IR M . However, for 
general sets, the gradient VH73 cannot be computed explicitly and the com- 
putation effort of implementing an iteration of the algorithm can become 
prohibitive. Hence, it is important to consider only the sets for which 
the solution 9*(z) = — VW^(z) of the optimization problem © can be 
easily computed. Some examples of such "simple" sets are: (i) the A- 
simplex 6m, a, (ii) the full-dimensional A-simplex {9 G : \\9\\i < A} and 
(iii) the symmetrized version of the latter, that is the hyper-octahedron 
{9 E M. M : 11$!^ < A}. 

6.3 Choice of the step size and temperature parame- 
ters 

The constant factor in the bound of Theorem^can be only slightly improved. 
The sequences (Pi) and (7^) as described in (|T7|) are close to optimal ones in 
a sense of the upper bound (j2T)j). Indeed, if we further bound V(9'X) by V* = 
max0 G e V(9) in (j2*U|) and minimize in (7$) and (Pi) under the monotonicity 
condition $ > Pi-x, we g e ^ that the minimum is obtained for sequences (74 ) 
and (Pi), which are independent of i and such that Pi/% = L v (X)^Jt/2aV* . 
We can take, for instance, 

which leads to a better constant than the bound © in Theorem 1 

EA(9 t ) - mm A(9) < XL V (X)J^^. (38) 
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Thus, we can improve the constant factor in the upper bound from 2 to \/2- 
However, in order to make this improvement, one needs to know the sample 
size t in advance, and this is not compatible with the online framework. 



Appendix 

In this appendix, we propose two different proofs of the fact that Assumption 
(L) holds for the /3-conjugate Wp of the entropy type function V given by 
(JHJ). First, we give a straightforward argument using the equations from 
The second proof is based on a generic argument which exploits the convexity 
properties of function V rather than its particular expression. 



A. Direct proof 

Evidently, the function Wp in (JOJ) is twice continuously differentiate on E* 
eM . Set 



"oo ■ 



L= sup r r < sup sup x V W(z)y 

<i zeE* 

where the second derivative matrix V 2 W(z) has the entries 

d 2 W{z) X ( e'^Sij e -Zi/P e -*i/P 



d Zi d Zj p\Y,k e -* klf> (£ fc e ~ W/3 ) 5 

Here 5^ stands for the Kronecker symbol. Denote a, = e~ Zi ^ / J^ fc e~ Zk ^ 
which are evidently positive with Yli a i = !■ Now, 

y x T \7 2 W(z)y = ^ x iVi a i ~ ^2 aiXi S a i y i = S x ^ a ^ ~ a ^ 

i i j i j 

= ^Xidi^ajiyi-yj) ^^^didjlyi-yjl. (39) 

Finally, the latter sum is bounded by 1 for any \yi\ < 1 and > 0, J2i a i = 1- 
To see this, note that the maximum of the convex (in y G R M ) function 
of the right hand side (pl9*|) on the convex set {y G R M : \\y\\oo < 1} is 
always attained at the extreme points of the set, which are the vertices of 
the hypercube {y G R M : yi = ±1, i = 1,...,M}. Denote this extreme 
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point y* = (y 1 , . . . , y M ) T ■ Let us split the index set {1, . . . , M} into J + = 
{i : = 1} and J_ = {i : y* = — 1}. Then the maximal value of the sum 
can be decomposed, and we get 

y L < 2 ajOj + 2 a«aj = 4 a 5 - = 4a + (l — a + ) < 1 

jei+,iei- jei- 

where a + = a^. Hence, a = 1/A which is independent of (3. 



B. General argument 

Consider first a more general setting, where E is the space R M equipped with 
some norm || • || (primal space), and E* is R M equipped with the corresponding 
dual norm || • ||* (dual space). Let now V : — > K. be a-strongly convex 
function with respect to primal norm || • || on a convex closed set C R M . 
It is easy to show that inequality © (holding for all x, y G and for any 
s G [0, 1]) implies 

V{x) > V{x*) + -\\x-x*\\ 2 , VxeO, (40) 

where x* = argmin xge V(x). Indeed, the existence and uniqueness of the 
minimizer x* is evident. Now for any x G and s G (0, 1), 

sV(x) + {l-s)V(x*) > V(sx + {l-s)x*) + -s{l-s)\\x-x*\\ 2 

ry 

> V(x*) + -s{l-s)\\x-x*\\ 2 

and we get (|4T!|) by subtracting V(x*), dividing by s, and then letting s tend 
to (as V is continuous on 0). 

We assume, furthermore, that the /^-conjugate Wp defined by (0) is con- 
tinuously differentiable on E* and the assertion 2(h) of Proposition ^ holds 
true. Let us fix any points Zi,Z2 G E* and arbitrary s G (0, 1). Denote 

xx = -VWp(zx), x 2 = -VW p (z 2 ), 

and x s = sxx + (l — s)x 2 ■ Recall that, due to the assertion 2(h) of Proposition 
[TJ we have: Xk = axgxxim. e&e {z k r 6 + (3V{6)}, k = 1,2. 
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Now we are done since the function z T x + j3V(x) is (a/3)-strongly convex 
for any fixed z, hence, by (EIUJ) . 



-zl x s - (3V(x s ) < -z[x x - I3V(xi) - ^-\\x s - XiW 2 , 
-zlx s - (3V(x s ) < -z%x 2 - f3V(x 2 ) - ^-\\x s - x 2 \\ 2 , 
and summing up with the coefficients s and 1-swe get by definition of x s : 

s(l - s)(z 1 - z 2 ) T {x 2 - xi) = szi (xi - x s ) + (1 - s) zl{x 2 - x 3 ) 

< (3 (y(x s ) - sVfa) - (1 - s)V(x 2 ) 

- s ) ll^i — ^2 1| 2 ) 

< — a(3s{l — s)\\x\ — x 2 \\ 2 . 

Therefore, 

a(5\\x 2 - xi|| 2 < (zi - z 2 ) T (xi - x 2 ) < \\zi - z 2 ||* ||xi - x 2 \\ , 
and it implies, both for x% = x 2 and for x% ^ x 2 , 

\\VW p {z x ) - VW p {z 2 )\\ < ^-\\z x - z 2 \\* 

which implies the desired Lipschitz property for VWp in the assertion 2(i) of 
Proposition [T] 

Now we return to the particular case where || • || = || ■ ||i, || • ||* = || ■ ||oo 
and V is the entropy type proxy function V defined in flSJ). We prove that V 
is (l/A)-strongly convex with respect to the norm || • ||i , i.e. it satisfies (J7|) 
with a = 1/A. 

Proof of 0). Observe that function V defined in (JHJ is twice continuously 
differentiate at any point x = (x 1 , . . . , x M ) T inside the set ©m,a, with 



d 2 V{x) 8 l3 



dxflxj x 



— ■ i,j = l,...,M. 



Let us fix two arbitrary points x, y inside the set ©m,a- One may write, for 
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some interior point x G 6m,a> 



(VV(x)-VV(y),x-y) 



(x — y) W(x)(x — y) 



(41) 




2 




) 



2 



> 



(42) 



(43) 



where we used Jensen's 



inequality in ()42j) since all 2j > and 



M ~ 



i=l 



By the standard argument (see, e.g., J2E]), for all interior points x, y of 0m,a 
we get ((Zj) from pTjl - pHJ) . Finally, by continuity of V on 0m,a, © extends 
to all x, y in 6a/,a- ■ 
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